Move the value assignment of vector x in gemv_n_sve.c to the outermos… #5420
+6
−12
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Move the value assignment of vector x in gemv_n_sve.c to the outermost loop to reduce the repeated data retrieval.
1.Verify correctness using BLAS-Tester, as follows:
./xsl2blastst
------------------------------- GEMV --------------------------------
TST# TR M N ALPHA LDA INCX BETA INCY TIME MFLOP SpUp TEST
==== == ==== ==== ===== ==== ==== ===== ==== ====== ===== ===== =====
0 N 100 100 1.0 1000 1 1.0 1 0.00 7025.5 1.00 -----
0 N 100 100 1.0 1000 1 1.0 1 0.00 2479.6 0.35 PASS
1 N 200 200 1.0 1000 1 1.0 1 0.00 8852.2 1.00 -----
1 N 200 200 1.0 1000 1 1.0 1 0.00 7312.7 0.83 PASS
2 N 300 300 1.0 1000 1 1.0 1 0.00 8593.6 1.00 -----
2 N 300 300 1.0 1000 1 1.0 1 0.00 3601.1 0.42 PASS
3 N 400 400 1.0 1000 1 1.0 1 0.00 8670.0 1.00 -----
3 N 400 400 1.0 1000 1 1.0 1 0.00 11892.5 1.37 PASS
4 N 500 500 1.0 1000 1 1.0 1 0.00 10044.3 1.00 -----
4 N 500 500 1.0 1000 1 1.0 1 0.00 13902.3 1.38 PASS
5 N 600 600 1.0 1000 1 1.0 1 0.00 9877.2 1.00 -----
5 N 600 600 1.0 1000 1 1.0 1 0.00 14461.3 1.46 PASS
6 N 700 700 1.0 1000 1 1.0 1 0.00 10309.2 1.00 -----
6 N 700 700 1.0 1000 1 1.0 1 0.00 10684.0 1.04 PASS
7 N 800 800 1.0 1000 1 1.0 1 0.00 10330.9 1.00 -----
7 N 800 800 1.0 1000 1 1.0 1 0.00 13739.3 1.33 PASS
8 N 900 900 1.0 1000 1 1.0 1 0.00 11108.7 1.00 -----
8 N 900 900 1.0 1000 1 1.0 1 0.00 12660.2 1.14 PASS
9 N 1000 1000 1.0 1000 1 1.0 1 0.00 11904.7 1.00 -----
9 N 1000 1000 1.0 1000 1 1.0 1 0.00 15629.1 1.31 PASS
10 tests run, 10 passed
./xdl2blastst
------------------------------- GEMV --------------------------------
TST# TR M N ALPHA LDA INCX BETA INCY TIME MFLOP SpUp TEST
==== == ==== ==== ===== ==== ==== ===== ==== ====== ===== ===== =====
0 N 100 100 1.0 1000 1 1.0 1 0.00 4959.1 1.00 -----
0 N 100 100 1.0 1000 1 1.0 1 0.00 1453.5 0.29 PASS
1 N 200 200 1.0 1000 1 1.0 1 0.00 4946.8 1.00 -----
1 N 200 200 1.0 1000 1 1.0 1 0.00 2587.6 0.52 PASS
2 N 300 300 1.0 1000 1 1.0 1 0.00 5179.7 1.00 -----
2 N 300 300 1.0 1000 1 1.0 1 0.00 7271.5 1.40 PASS
3 N 400 400 1.0 1000 1 1.0 1 0.00 5622.8 1.00 -----
3 N 400 400 1.0 1000 1 1.0 1 0.00 7424.6 1.32 PASS
4 N 500 500 1.0 1000 1 1.0 1 0.00 5673.6 1.00 -----
4 N 500 500 1.0 1000 1 1.0 1 0.00 7578.5 1.34 PASS
5 N 600 600 1.0 1000 1 1.0 1 0.00 5961.4 1.00 -----
5 N 600 600 1.0 1000 1 1.0 1 0.00 7932.8 1.33 PASS
6 N 700 700 1.0 1000 1 1.0 1 0.00 6213.5 1.00 -----
6 N 700 700 1.0 1000 1 1.0 1 0.00 9348.5 1.50 PASS
7 N 800 800 1.0 1000 1 1.0 1 0.00 6160.6 1.00 -----
7 N 800 800 1.0 1000 1 1.0 1 0.00 10252.0 1.66 PASS
8 N 900 900 1.0 1000 1 1.0 1 0.00 6751.3 1.00 -----
8 N 900 900 1.0 1000 1 1.0 1 0.00 10656.0 1.58 PASS
9 N 1000 1000 1.0 1000 1 1.0 1 0.00 7910.3 1.00 -----
9 N 1000 1000 1.0 1000 1 1.0 1 0.00 10597.0 1.34 PASS
10 tests run, 10 passed
2.Using the built-in benchmark to verify performance, the performance of float and doule type improved by about 60% and about 40% respectively.
before optimization:
[root@localhost benchmark]# export OMP_NUM_THREADS=1;numactl -C 10 -l ./sgemv.goto 3000 4000 100
From : 3000 To : 4000 Step = 100 Trans = 'N' Inc_x = 1 Inc_y = 1 Loops = 1
SIZE Flops
3000x3000 : 11932.54 MFlops 0.001508 sec
3100x3100 : 11471.23 MFlops 0.001675 sec
3200x3200 : 11140.85 MFlops 0.001838 sec
3300x3300 : 11119.37 MFlops 0.001959 sec
3400x3400 : 11199.25 MFlops 0.002064 sec
3500x3500 : 11424.51 MFlops 0.002145 sec
3600x3600 : 11125.72 MFlops 0.002330 sec
3700x3700 : 11432.00 MFlops 0.002395 sec
3800x3800 : 11653.88 MFlops 0.002478 sec
3900x3900 : 11696.58 MFlops 0.002601 sec
4000x4000 : 11705.83 MFlops 0.002734 sec
[root@localhost benchmark]# export OMP_NUM_THREADS=1;numactl -C 10 -l ./dgemv.goto 3000 4000 100
From : 3000 To : 4000 Step = 100 Trans = 'N' Inc_x = 1 Inc_y = 1 Loops = 1
SIZE Flops
3000x3000 : 5260.93 MFlops 0.003421 sec
3100x3100 : 5490.46 MFlops 0.003501 sec
3200x3200 : 5318.63 MFlops 0.003851 sec
3300x3300 : 5284.31 MFlops 0.004122 sec
3400x3400 : 5243.10 MFlops 0.004410 sec
3500x3500 : 5317.14 MFlops 0.004608 sec
3600x3600 : 5004.25 MFlops 0.005180 sec
3700x3700 : 5351.32 MFlops 0.005116 sec
3800x3800 : 5221.78 MFlops 0.005531 sec
3900x3900 : 5224.54 MFlops 0.005823 sec
4000x4000 : 5194.21 MFlops 0.006161 sec
after optimization:
[root@localhost benchmark]# export OMP_NUM_THREADS=1;numactl -C 10 -l ./sgemv.goto 3000 4000 100
From : 3000 To : 4000 Step = 100 Trans = 'N' Inc_x = 1 Inc_y = 1 Loops = 1
SIZE Flops
3000x3000 : 17268.24 MFlops 0.001042 sec
3100x3100 : 19730.47 MFlops 0.000974 sec
3200x3200 : 16947.36 MFlops 0.001208 sec
3300x3300 : 18414.80 MFlops 0.001183 sec
3400x3400 : 18785.26 MFlops 0.001231 sec
3500x3500 : 18939.75 MFlops 0.001294 sec
3600x3600 : 17325.09 MFlops 0.001496 sec
3700x3700 : 18647.87 MFlops 0.001468 sec
3800x3800 : 18729.12 MFlops 0.001542 sec
3900x3900 : 19344.94 MFlops 0.001573 sec
4000x4000 : 18068.97 MFlops 0.001771 sec
[root@localhost benchmark]# export OMP_NUM_THREADS=1;numactl -C 10 -l ./dgemv.goto 3000 4000 100
From : 3000 To : 4000 Step = 100 Trans = 'N' Inc_x = 1 Inc_y = 1 Loops = 1
SIZE Flops
3000x3000 : 7592.27 MFlops 0.002371 sec
3100x3100 : 7880.05 MFlops 0.002439 sec
3200x3200 : 7531.85 MFlops 0.002719 sec
3300x3300 : 7511.61 MFlops 0.002900 sec
3400x3400 : 7332.10 MFlops 0.003153 sec
3500x3500 : 7235.68 MFlops 0.003386 sec
3600x3600 : 7010.80 MFlops 0.003697 sec
3700x3700 : 7107.42 MFlops 0.003852 sec
3800x3800 : 6901.65 MFlops 0.004185 sec
3900x3900 : 6898.33 MFlops 0.004410 sec
4000x4000 : 6809.35 MFlops 0.004699 s